In this practice it will be analized the correlation between certain variables and a payment fraud event.
First, we will analyse, select and do the necessary changes on the data. Afterwards, we will create several prediction models in order to select the better one. Once this selection is done. the explicability of the results will be studied.
The data was provided by the professors of the course
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')
#Functions document
import functions as fx
#Read the data and edit the ',' in numeric data
pd_fraud = (pd.read_csv("../data/Original_dataset_payments_fraud.csv", delimiter=";")
.assign(**{'connection_time': lambda df: df['connection_time'].str.replace(',', '.').astype(float)})
)
pd_fraud.head()
| step | type | amount | gender | device | connection_time | nameOrig | race | oldbalanceOrg | age | newbalanceOrig | zone | user_number | nameDest | user_connections | security_alert | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | PAYMENT | 9839.64 | man | mac | 0.140039 | C1231006815 | black | 170136.0 | 85 | 160296.36 | capital | 138 | M1979787155 | 5 | 1 | 0.0 | 0.0 | 0 |
| 1 | 1 | PAYMENT | 1864.28 | woman | mac | 0.496890 | C1666544295 | asian | 21249.0 | 57 | 19384.72 | country | 909 | M2044282225 | 1 | 0 | 0.0 | 0.0 | 0 |
| 2 | 1 | TRANSFER | 181.00 | man | pc | 0.781150 | C1305486145 | asian | 181.0 | 66 | 0.00 | capital | 2569 | C553264065 | 10 | 0 | 0.0 | 0.0 | 1 |
| 3 | 1 | CASH_OUT | 181.00 | man | mac | 0.565068 | C840083671 | black | 181.0 | 31 | 0.00 | country | 1787 | C38997010 | 3 | 0 | 21182.0 | 0.0 | 1 |
| 4 | 1 | PAYMENT | 11668.14 | unknow | mac | 0.517114 | C2048537720 | black | 41554.0 | 90 | 29885.86 | country | 3997 | M1230701703 | 8 | 0 | 0.0 | 0.0 | 0 |
We proceed to delete the gender and race variables in order not to discriminate anyone. Furthermore, nameOrig and nameDest are dropped too as these variables are just to identify the user, not relevant for this study.
pd_fraud = pd_fraud.drop(["gender", "race", "nameOrig", "nameDest"], axis=1)
#Duplicates
print(pd_fraud.shape, pd_fraud.drop_duplicates().shape)
(1048575, 15) (1048575, 15)
There are not duplicates in this dataset.
#Divide the target between frauds and no frauds
pd_plot_fraud_status = pd_fraud['isFraud']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
pd_plot_fraud_status_conteo = pd_fraud['isFraud'].value_counts().reset_index()
pd_plot_fraud_status_pc = pd.merge(pd_plot_fraud_status, pd_plot_fraud_status_conteo,
on=['index'], how='inner')
pd_plot_fraud_status_pc
| index | percent | isFraud | |
|---|---|---|---|
| 0 | 0 | 99.89109 | 1047433 |
| 1 | 1 | 0.10891 | 1142 |
#Plot the results
fig = px.histogram(pd_plot_fraud_status_pc, x="index", y=['percent'])
fig.show()
It can be appreciated that we have a very unbalanced target with the 99.89% of the data being no fraud, while only the 0.11% are.
#Calculation od the null data
pd_series_null_columns = pd_fraud.isnull().sum().sort_values(ascending=False)
pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])
pd_null_columnas
| nulos_columnas | |
|---|---|
| device | 104580 |
| zone | 104414 |
| step | 0 |
| type | 0 |
| amount | 0 |
| connection_time | 0 |
| oldbalanceOrg | 0 |
| age | 0 |
| newbalanceOrig | 0 |
| user_number | 0 |
| user_connections | 0 |
| security_alert | 0 |
| oldbalanceDest | 0 |
| newbalanceDest | 0 |
| isFraud | 0 |
Las columnas dispositivo y zona tienen un 10% de nulas. Este número es muy significativo, pero vamos a decidir dejarlos para el análisis ya que la omisión a estas respuestas nos pueden dar información útil a la hora de predecir
#Selection of the categorical and numerical variables
list_cat_var, other = fx.dame_variables_categoricas(dataset=pd_fraud)
pd_fraud[list_cat_var] = pd_fraud[list_cat_var].astype("category")
list_numeric_var = list(pd_fraud.select_dtypes(include=('int','float')).columns)
pd_fraud[list_numeric_var] = pd_fraud[list_numeric_var]#.astype(float)
pd_fraud.dtypes
step int64 type category amount float64 device category connection_time float64 oldbalanceOrg float64 age int64 newbalanceOrig float64 zone category user_number int64 user_connections int64 security_alert category oldbalanceDest float64 newbalanceDest float64 isFraud category dtype: object
pd_fraud[list_cat_var].dtypes
type category device category zone category security_alert category isFraud category dtype: object
#Plot the histogram for variable device
fx.histogram(pd_fraud, 'device')
The most used device is the mac, followed by PC and by iPhone
#Plot the histogram for variable type
fx.histogram(pd_fraud, 'type')
As it can be appreciated in the graph above, the most frequent transactions are the payments and to take money out. The debit is very little small compared with the other options.
#Plot the boxplotfor variable amount
fx.box_plot(pd_fraud, 'amount')
The amount of each transaction is much higher in those occasions where the payment fraud happened
#Plot the boxplotfor variable age
fx.box_plot(pd_fraud, 'age')
At first sight, the boxplots of the age seem very similar when the fraud does and does not occured
#Use the get_deviation_of_mean_perc function to calculate outliers
fx.get_deviation_of_mean_perc(pd_fraud, list_numeric_var, target='isFraud', multiplier=3)
| 0.0 | 1.0 | variable | sum_outlier_values | porcentaje_sum_null_values | |
|---|---|---|---|---|---|
| 0 | 0.986447 | 0.013553 | step | 18004 | 0.017170 |
| 1 | 0.981773 | 0.018227 | amount | 19422 | 0.018522 |
| 2 | 0.999311 | 0.000689 | oldbalanceOrg | 26140 | 0.024929 |
| 3 | 1.000000 | 0.000000 | newbalanceOrig | 26282 | 0.025064 |
| 4 | 0.999427 | 0.000573 | oldbalanceDest | 22685 | 0.021634 |
| 5 | 0.998513 | 0.001487 | newbalanceDest | 22862 | 0.021803 |
As the results show, there exists outliers. However, as the dataset is very unbalanced, the data will not be modify in case these outliers can give some explanation about the results
#plot the correlation matrix
fx.get_corr_matrix(dataset = pd_fraud[list_numeric_var],
metodo='pearson', size_figure=[10,8])
0
#Calculate the numeric results of the correlation
corr = pd_fraud[list_numeric_var].corr('pearson')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1) # below main lower triangle of an array
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']>0.6]
| level_0 | level_1 | correlation | |
|---|---|---|---|
| 53 | newbalanceOrig | oldbalanceOrg | 0.999050 |
| 98 | newbalanceDest | oldbalanceDest | 0.978401 |
There is high correlation between the new and old balance of the origin, and the same for the destiny. However, as this is a very unblanaced target, we will not modify them for now.
#Select all the columns except the target variable
X_fraud = pd_fraud.drop('isFraud',axis=1)
#Select the target variable
y_fraud =pd_fraud['isFraud']
list_columns_cat = list(X_fraud.select_dtypes("category").columns)
#convertion of categorical variables with one hot encoder
ohe = ce.OneHotEncoder(cols=list_columns_cat)
model = ohe.fit(X_fraud, y_fraud)
model
OneHotEncoder(cols=['type', 'device', 'zone', 'security_alert'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
OneHotEncoder(cols=['type', 'device', 'zone', 'security_alert'])
X_fraud_t = model.transform(X_fraud, y_fraud)
# X_train_t.columns = list_columns_cat
# X_test_t.columns = list_columns_cat
len(list(X_fraud_t.columns))
25
X_fraud_t.dtypes.to_dict()
{'step': dtype('int64'),
'type_1': dtype('int64'),
'type_2': dtype('int64'),
'type_3': dtype('int64'),
'type_4': dtype('int64'),
'type_5': dtype('int64'),
'amount': dtype('float64'),
'device_1': dtype('int64'),
'device_2': dtype('int64'),
'device_3': dtype('int64'),
'device_4': dtype('int64'),
'connection_time': dtype('float64'),
'oldbalanceOrg': dtype('float64'),
'age': dtype('int64'),
'newbalanceOrig': dtype('float64'),
'zone_1': dtype('int64'),
'zone_2': dtype('int64'),
'zone_3': dtype('int64'),
'zone_4': dtype('int64'),
'user_number': dtype('int64'),
'user_connections': dtype('int64'),
'security_alert_1': dtype('int64'),
'security_alert_2': dtype('int64'),
'oldbalanceDest': dtype('float64'),
'newbalanceDest': dtype('float64')}
X_fraud_t.head()
| step | type_1 | type_2 | type_3 | type_4 | type_5 | amount | device_1 | device_2 | device_3 | ... | zone_1 | zone_2 | zone_3 | zone_4 | user_number | user_connections | security_alert_1 | security_alert_2 | oldbalanceDest | newbalanceDest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | 0 | 0 | 0 | 9839.64 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 138 | 5 | 1 | 0 | 0.0 | 0.0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1864.28 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 909 | 1 | 0 | 1 | 0.0 | 0.0 |
| 2 | 1 | 0 | 1 | 0 | 0 | 0 | 181.00 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 2569 | 10 | 0 | 1 | 0.0 | 0.0 |
| 3 | 1 | 0 | 0 | 1 | 0 | 0 | 181.00 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1787 | 3 | 0 | 1 | 21182.0 | 0.0 |
| 4 | 1 | 1 | 0 | 0 | 0 | 0 | 11668.14 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 3997 | 8 | 0 | 1 | 0.0 | 0.0 |
5 rows × 25 columns
#20% of the data will be used as test data. Due to imbalance, use of stratify
X_train, X_test, y_train, y_test = train_test_split(X_fraud_t, y_fraud, stratify=y_fraud, test_size=0.2, random_state=1)
X_test['isFraud'] = y_test
X_train['isFraud'] = y_train
#Check whether the split has similar results
X_train.describe().round(2)
| step | type_1 | type_2 | type_3 | type_4 | type_5 | amount | device_1 | device_2 | device_3 | ... | zone_1 | zone_2 | zone_3 | zone_4 | user_number | user_connections | security_alert_1 | security_alert_2 | oldbalanceDest | newbalanceDest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 838860.00 | 838860.00 | 838860.00 | 838860.00 | 838860.00 | 838860.00 | 838860.00 | 838860.00 | 838860.00 | 838860.00 | ... | 838860.00 | 838860.00 | 838860.00 | 838860.0 | 838860.00 | 838860.00 | 838860.0 | 838860.0 | 838860.00 | 838860.00 |
| mean | 26.96 | 0.34 | 0.08 | 0.36 | 0.01 | 0.22 | 158843.80 | 0.35 | 0.30 | 0.25 | ... | 0.25 | 0.35 | 0.30 | 0.1 | 2530.30 | 5.50 | 0.1 | 0.9 | 977107.51 | 1113307.51 |
| std | 15.62 | 0.47 | 0.28 | 0.48 | 0.08 | 0.41 | 265086.90 | 0.48 | 0.46 | 0.43 | ... | 0.43 | 0.48 | 0.46 | 0.3 | 1425.54 | 2.87 | 0.3 | 0.3 | 2293281.74 | 2413068.46 |
| min | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.10 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 59.00 | 1.00 | 0.0 | 0.0 | 0.00 | 0.00 |
| 25% | 15.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 12129.44 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1296.00 | 3.00 | 0.0 | 1.0 | 0.00 | 0.00 |
| 50% | 20.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 76315.05 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 2532.00 | 6.00 | 0.0 | 1.0 | 126301.18 | 218243.75 |
| 75% | 39.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 213818.56 | 1.00 | 1.00 | 0.00 | ... | 0.00 | 1.00 | 1.00 | 0.0 | 3763.00 | 8.00 | 0.0 | 1.0 | 916432.90 | 1150730.90 |
| max | 95.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 10000000.00 | 1.00 | 1.00 | 1.00 | ... | 1.00 | 1.00 | 1.00 | 1.0 | 5000.00 | 10.00 | 1.0 | 1.0 | 42054659.73 | 42169156.09 |
8 rows × 25 columns
X_test.describe().round(2)
| step | type_1 | type_2 | type_3 | type_4 | type_5 | amount | device_1 | device_2 | device_3 | ... | zone_1 | zone_2 | zone_3 | zone_4 | user_number | user_connections | security_alert_1 | security_alert_2 | oldbalanceDest | newbalanceDest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 209715.00 | 209715.00 | 209715.00 | 209715.00 | 209715.00 | 209715.00 | 209715.00 | 209715.00 | 209715.00 | 209715.00 | ... | 209715.00 | 209715.00 | 209715.00 | 209715.0 | 209715.00 | 209715.00 | 209715.0 | 209715.0 | 209715.00 | 209715.00 |
| mean | 27.01 | 0.34 | 0.08 | 0.36 | 0.01 | 0.22 | 157959.66 | 0.35 | 0.30 | 0.25 | ... | 0.25 | 0.35 | 0.30 | 0.1 | 2530.98 | 5.50 | 0.1 | 0.9 | 982369.78 | 1117736.29 |
| std | 15.64 | 0.47 | 0.28 | 0.48 | 0.08 | 0.41 | 264355.71 | 0.48 | 0.46 | 0.43 | ... | 0.43 | 0.48 | 0.46 | 0.3 | 1429.94 | 2.87 | 0.3 | 0.3 | 2310717.60 | 2430450.92 |
| min | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.14 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 59.00 | 1.00 | 0.0 | 0.0 | 0.00 | 0.00 |
| 25% | 15.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 12236.51 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 1289.00 | 3.00 | 0.0 | 1.0 | 0.00 | 0.00 |
| 50% | 20.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 76509.09 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.0 | 2532.00 | 5.00 | 0.0 | 1.0 | 126703.03 | 218308.07 |
| 75% | 39.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 213526.00 | 1.00 | 1.00 | 0.00 | ... | 1.00 | 1.00 | 1.00 | 0.0 | 3771.00 | 8.00 | 0.0 | 1.0 | 914317.74 | 1147577.34 |
| max | 95.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 10000000.00 | 1.00 | 1.00 | 1.00 | ... | 1.00 | 1.00 | 1.00 | 1.0 | 5000.00 | 10.00 | 1.0 | 1.0 | 41259370.36 | 42054659.73 |
8 rows × 25 columns
#Create new CSV documetns with the training and test data separated
X_train.to_parquet("../data/training_data.parquet")
X_test.to_parquet("../data/test_data.parquet")